{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing\n",
    "\n",
    "This notebook will give you a taste of what scikit-learn provides for preprocessing data.\n",
    "\n",
    "## Data used\n",
    "We will be using the planets data and red wine data:\n",
    "\n",
    "### Data License for Planet Data\n",
    "Copyright (C) 2012 Hanno Rein\n",
    "\n",
    "Permission is hereby granted, free of charge, to any person obtaining a copy of this database and associated scripts (the \"Database\"), to deal in the Database without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Database, and to permit persons to whom the Database is furnished to do so, subject to the following conditions:\n",
    "\n",
    "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Database. A reference to the Database shall be included in all scientific publications that make use of the Database.\n",
    "\n",
    "THE DATABASE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATABASE OR THE USE OR OTHER DEALINGS IN THE DATABASE.\n",
    "\n",
    "### Citations for Red Wine Data\n",
    "P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. \n",
    "Modeling wine preferences by data mining from physicochemical properties.\n",
    "In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.\n",
    "\n",
    "Available at:\n",
    "- [@Elsevier](http://dx.doi.org/10.1016/j.dss.2009.05.016)\n",
    "- [Pre-press (pdf)](http://www3.dsi.uminho.pt/pcortez/winequality09.pdf)\n",
    "- [bib](http://www3.dsi.uminho.pt/pcortez/dss09.bib)\n",
    "\n",
    "Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.\n",
    "\n",
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "planets = pd.read_csv('data/planets.csv')\n",
    "red_wine = pd.read_csv('data/winequality-red.csv')\n",
    "wine = pd.concat([\n",
    "    pd.read_csv('data/winequality-white.csv', sep=';').assign(kind='white'), \n",
    "    red_wine.assign(kind='red')\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train Test Split\n",
    "We can use scikit-learn's `train_test_split()` function to get our training and testing sets. (We will discuss the validation set in [chapter 10](https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/blob/master/ch_10).)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = planets[['eccentricity', 'semimajoraxis', 'mass']]\n",
    "y = planets.period\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.25, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The original data had this shape:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((3814, 3), (3814,))"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape, y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our training data has this shape:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((2860, 3), (2860,))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape, y_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our testing data has this shape:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((954, 3), (954,))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first 5 entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>eccentricity</th>\n",
       "      <th>semimajoraxis</th>\n",
       "      <th>mass</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1846</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1918</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3261</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.074</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1094</th>\n",
       "      <td>0.073</td>\n",
       "      <td>2.600</td>\n",
       "      <td>1.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      eccentricity  semimajoraxis  mass\n",
       "1846           NaN            NaN   NaN\n",
       "1918           NaN            NaN   NaN\n",
       "3261           NaN          0.074   NaN\n",
       "3000           NaN            NaN   NaN\n",
       "1094         0.073          2.600   1.6"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our y data will be for the same rows as our X:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1846      52.661793\n",
       "1918      23.622591\n",
       "3261       7.008151\n",
       "3000     226.890470\n",
       "1094    1251.000000\n",
       "Name: period, dtype: float64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scaling data\n",
    "### Standardizing with `StandardScaler`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.17649619, -0.5045706 ,  0.14712504, -0.12764807,  1.09368797,\n",
       "        0.67240099, -0.01731216, -0.84296411, -0.18098025, -0.13758824,\n",
       "       -0.00791305, -0.09869129, -0.26808282, -0.18032045, -0.15945662,\n",
       "        0.36484041, -0.15305095, -0.28352985, -0.17803358, -0.18017312,\n",
       "       -0.238978  ,  0.05247717, -0.16875798, -0.26094578, -0.18022437,\n",
       "       -0.22704979, -0.25988606, -0.17954522, -0.2851004 , -0.68678249])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "standardized = StandardScaler().fit_transform(X_train)\n",
    "\n",
    "# examine some of the non-NaN values\n",
    "standardized[~np.isnan(standardized)][:30]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([9.48059029e-03, 3.80041941e-01, 3.33101821e-01, 1.59042751e-01,\n",
       "       1.97830051e+00, 8.58377769e-01, 2.69378660e-01, 4.16484319e-02,\n",
       "       4.99652731e-03, 1.49102579e-01, 8.76699491e-01, 8.72854887e-02,\n",
       "       1.86080019e-02, 5.65632515e-03, 1.27234201e-01, 1.24945296e+00,\n",
       "       3.29258338e-02, 3.16097468e-03, 7.94319727e-03, 5.80365865e-03,\n",
       "       4.77128253e-02, 9.37089717e-01, 1.72188018e-02, 2.57450453e-02,\n",
       "       5.75241221e-03, 5.96410317e-02, 6.24726478e-01, 6.43155558e-03,\n",
       "       1.59042751e-03, 1.97830051e-01])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standardized = StandardScaler(with_mean=False).fit_transform(X_train)\n",
    "standardized[~np.isnan(standardized)][:30]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-1.37762709e+00, -9.69199710e-02,  1.14837291e+00, -1.28416362e+00,\n",
       "        2.10080029e-01,  5.24837291e+00, -1.74163617e-01, -1.61919971e-01,\n",
       "       -1.41262709e+00, -1.38416362e+00, -1.51997104e-03, -7.70327089e-01,\n",
       "       -2.69696362e+00, -1.40747709e+00, -1.60416362e+00,  7.00800290e-02,\n",
       "       -1.19462709e+00, -2.85236362e+00, -1.38962709e+00, -1.40632709e+00,\n",
       "       -2.40416362e+00,  1.00800290e-02, -1.31722709e+00, -2.62516362e+00,\n",
       "       -1.40672709e+00, -2.28416362e+00, -4.99199710e-02, -1.40142609e+00,\n",
       "       -2.86816362e+00, -1.31919971e-01])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standardized = StandardScaler(with_std=False).fit_transform(X_train)\n",
    "standardized[~np.isnan(standardized)][:30]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalizing with `MinMaxScaler`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3.93117161e-04, 7.63598326e-02, 1.46646600e-02, 6.08362087e-03,\n",
       "       3.97489540e-01, 3.78290803e-02, 1.03041533e-02, 8.36820084e-03,\n",
       "       1.95372110e-04, 5.70339272e-03, 1.76150628e-01, 3.82427629e-03,\n",
       "       7.11757595e-04, 2.24468882e-04, 4.86689080e-03, 2.51046025e-01,\n",
       "       1.42704129e-03, 1.20883052e-04, 3.25318858e-04, 2.30966220e-04,\n",
       "       1.82506561e-03, 1.88284519e-01, 7.34368621e-04, 9.84761405e-04,\n",
       "       2.28706276e-04, 2.28133939e-03, 1.25523013e-01, 2.58656177e-04,\n",
       "       6.08070051e-05, 3.97489540e-02])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "normalized = MinMaxScaler().fit_transform(X_train)\n",
    "\n",
    "# examine some of the non-NaN values\n",
    "normalized[~np.isnan(normalized)][:30]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using the Median and IQR with `RobustScaler`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.06900269, -0.16197792,  2.09219752,  0.29241339,  1.18200107,\n",
       "        5.60008385,  0.75183146, -0.44653374, -0.09894806,  0.25102438,\n",
       "        0.25566245,  0.45059228, -0.29233062, -0.09454181,  0.15996854,\n",
       "        0.56911162,  0.08756882, -0.35664915, -0.07926968, -0.0935579 ,\n",
       "       -0.17114358,  0.30644472, -0.01732554, -0.2626133 , -0.09390013,\n",
       "       -0.12147676,  0.04377782, -0.08936469, -0.36318861, -0.31520028])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import RobustScaler\n",
    "\n",
    "robust_scaled = RobustScaler().fit_transform(X_train)\n",
    "\n",
    "# examine some of the non-NaN values\n",
    "robust_scaled[~np.isnan(robust_scaled)][:30]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encoding\n",
    "### Binary encoding with `np.where()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 1, 1, 1])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.where(wine.kind == 'red', 1, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use the `LabelBinarizer` from scikit-learn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelBinarizer\n",
    "\n",
    "binary_labels = LabelBinarizer().fit(wine.kind)\n",
    "binary_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `classes_` attribute to see the classes labeled and the `inverse_transform()` to see the labels assigned to each value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['red', 'white'], dtype='<U5')"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "binary_labels.inverse_transform(np.array([0, 1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `Binarizer` for binary encoding of values based on a threshold. Values less than or equal to `threshold` will be 0; values greater than `threshold` will be 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1382\n",
       "1     217\n",
       "dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import Binarizer\n",
    "\n",
    "pd.Series(Binarizer(threshold=6).fit_transform(red_wine.quality.values.reshape(-1, 1)).flatten()).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ordinal Encoding with the `LabelEncoder`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0, 1, 2}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "set(LabelEncoder().fit_transform(pd.cut(\n",
    "    red_wine.quality.sort_values(),\n",
    "    bins=[-1, 3, 6, 10],\n",
    "    labels=['low', 'med', 'high']\n",
    ")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One-hot encoding with the `OneHotEncoder`\n",
    "In some cases, label encoding may yield some associations that aren't something we want the model to be trained on. A safer strategy is to use one-hot encoding.\n",
    "\n",
    "Our planets data has a `list` column that we can one-hot encode:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Confirmed planets                    3683\n",
       "Controversial                         106\n",
       "Retracted planet candidate             11\n",
       "Solar System                            9\n",
       "Kepler Objects of Interest              4\n",
       "Planets in binary systems, S-type       1\n",
       "Name: list, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "planets.list.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using `pd.get_dummies()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Confirmed planets</th>\n",
       "      <th>Controversial</th>\n",
       "      <th>Kepler Objects of Interest</th>\n",
       "      <th>Planets in binary systems, S-type</th>\n",
       "      <th>Retracted planet candidate</th>\n",
       "      <th>Solar System</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Confirmed planets  Controversial  Kepler Objects of Interest  \\\n",
       "0                  1              0                           0   \n",
       "1                  1              0                           0   \n",
       "2                  1              0                           0   \n",
       "3                  1              0                           0   \n",
       "4                  0              1                           0   \n",
       "\n",
       "   Planets in binary systems, S-type  Retracted planet candidate  Solar System  \n",
       "0                                  0                           0             0  \n",
       "1                                  0                           0             0  \n",
       "2                                  0                           0             0  \n",
       "3                                  0                           0             0  \n",
       "4                                  0                           0             0  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(planets.list).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us a redundant column. Note that we only need one less column than the number of planet lists. Pandas makes it easy to remove one of the columns to address multicollinearity:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Controversial</th>\n",
       "      <th>Kepler Objects of Interest</th>\n",
       "      <th>Planets in binary systems, S-type</th>\n",
       "      <th>Retracted planet candidate</th>\n",
       "      <th>Solar System</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Controversial  Kepler Objects of Interest  \\\n",
       "0              0                           0   \n",
       "1              0                           0   \n",
       "2              0                           0   \n",
       "3              0                           0   \n",
       "4              1                           0   \n",
       "\n",
       "   Planets in binary systems, S-type  Retracted planet candidate  Solar System  \n",
       "0                                  0                           0             0  \n",
       "1                                  0                           0             0  \n",
       "2                                  0                           0             0  \n",
       "3                                  0                           0             0  \n",
       "4                                  0                           0             0  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(planets.list, drop_first=True).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use the `LabelBinarizer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 0, 0, 0, 0, 0],\n",
       "       [1, 0, 0, 0, 0, 0],\n",
       "       [1, 0, 0, 0, 0, 0],\n",
       "       ...,\n",
       "       [1, 0, 0, 0, 0, 0],\n",
       "       [1, 0, 0, 0, 0, 0],\n",
       "       [1, 0, 0, 0, 0, 0]], dtype=int32)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelBinarizer\n",
    "\n",
    "LabelBinarizer().fit_transform(planets.list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imputing\n",
    "### `SimpleImputer`\n",
    "We can fill with the mean, median, `most_frequent` (mode), or a constant value by specifiying the `strategy`. The default is the mean:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.29      , 19.4       ,  0.231     ],\n",
       "       [ 1.54      , 11.2       ,  0.08      ],\n",
       "       [ 0.83      ,  4.8       ,  0.        ],\n",
       "       ...,\n",
       "       [ 1.61032944,  0.3334    ,  0.31      ],\n",
       "       [ 1.61032944,  0.4       ,  0.27      ],\n",
       "       [ 1.61032944,  0.42      ,  0.16      ]])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "\n",
    "SimpleImputer().fit_transform(\n",
    "    planets[['semimajoraxis', 'mass', 'eccentricity']]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Changing to the median is just a matter of passing that as the `strategy`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.29    , 19.4     ,  0.231   ],\n",
       "       [ 1.54    , 11.2     ,  0.08    ],\n",
       "       [ 0.83    ,  4.8     ,  0.      ],\n",
       "       ...,\n",
       "       [ 0.163518,  0.3334  ,  0.31    ],\n",
       "       [ 0.163518,  0.4     ,  0.27    ],\n",
       "       [ 0.163518,  0.42    ,  0.16    ]])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "\n",
    "SimpleImputer(strategy='median').fit_transform(\n",
    "    planets[['semimajoraxis', 'mass', 'eccentricity']]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `MissingIndicator`\n",
    "We can mark where values are missing and use this as a feature in our model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[False, False, False],\n",
       "       [False, False, False],\n",
       "       [False, False, False],\n",
       "       ...,\n",
       "       [ True, False, False],\n",
       "       [ True, False, False],\n",
       "       [ True, False, False]])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.impute import MissingIndicator\n",
    "\n",
    "MissingIndicator().fit_transform(\n",
    "    planets[['semimajoraxis', 'mass', 'eccentricity']]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional Transformers\n",
    "### `FunctionTransformer`\n",
    "With the `FunctionTransformer`, we can use any function on the data. By passing `validate=True`, we will convert the result to two-dimensional numpy array and raise an error if there is an issue:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.073 , 2.6   , 1.6   ],\n",
       "       [0.38  , 6.7   , 2.71  ],\n",
       "       [0.008 , 0.039 , 1.5   ],\n",
       "       ...,\n",
       "       [0.4   , 1.9   , 5.3   ],\n",
       "       [0.2   , 0.172 , 0.0162],\n",
       "       [0.249 , 4.62  , 2.99  ]])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import FunctionTransformer\n",
    "\n",
    "FunctionTransformer(\n",
    "    np.abs, validate=True\n",
    ").fit_transform(X_train.dropna())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `ColumnTransformer`\n",
    "Sometimes we don't want to perform the same transformation on all of our features, the `ColumnTransformer` lets us specify which tranformations to use on each column. We pass a list of tuples in the form (name, transformer object, columns to apply to):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[            nan,             nan,             nan,\n",
       "         1.69919971e-01,  2.88416362e+00],\n",
       "       [-7.91305129e-03, -9.86912907e-02,  7.11757595e-04,\n",
       "         1.68400000e-01,  1.87200000e-01],\n",
       "       [            nan,             nan,             nan,\n",
       "         1.69919971e-01,  2.88416362e+00],\n",
       "       [            nan, -1.80320454e-01,  4.86689080e-03,\n",
       "         1.69919971e-01,  1.28000000e+00],\n",
       "       [ 3.64840414e-01, -1.53050946e-01,  1.20883052e-04,\n",
       "         2.40000000e-01,  3.18000000e-02]])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.compose import ColumnTransformer \n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
    "\n",
    "ColumnTransformer(\n",
    "    [\n",
    "        ('standard_scale', StandardScaler(), [0, 1]),\n",
    "        ('min_max', MinMaxScaler(), [2]),\n",
    "        ('impute', SimpleImputer(), [0, 2])\n",
    "    ]\n",
    ").fit_transform(X_train)[15:20] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `make_column_transformer()` which will name the transformers for us:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.49212919, -0.00209303, -0.22454741, ...,  1.        ,\n",
       "         0.        ,  0.        ],\n",
       "       [-0.40533358, -0.68276379, -0.23169035, ...,  1.        ,\n",
       "         0.        ,  0.        ],\n",
       "       [-0.55610666, -1.04338405, -0.23162366, ...,  1.        ,\n",
       "         0.        ,  0.        ],\n",
       "       ...,\n",
       "       [-0.5979688 , -0.68276379, -0.76124811, ...,  0.        ,\n",
       "         0.        ,  1.        ],\n",
       "       [-0.62883808, -0.88110493,  0.40907393, ...,  1.        ,\n",
       "         0.        ,  0.        ],\n",
       "       [-0.5785095 , -1.04338405, -0.20563417, ...,  1.        ,\n",
       "         0.        ,  0.        ]])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.compose import make_column_transformer\n",
    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
    "\n",
    "categorical = [\n",
    "    col for col in planets.columns \\\n",
    "    if col in [\n",
    "        'list', 'name', 'description', \n",
    "        'discoverymethod', 'lastupdate'\n",
    "    ]\n",
    "]\n",
    "numeric = [col for col in planets.columns if col not in categorical]\n",
    "\n",
    "make_column_transformer(\n",
    "    (StandardScaler(), numeric),\n",
    "    (OneHotEncoder(sparse=False), categorical)\n",
    ").fit_transform(planets.dropna())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `Pipeline`\n",
    "Using pipelines ensures the whole model training and testing process is consistent. To make a pipeline, we pass in a list of steps as tuples of (name, object):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
       "         normalize=False))])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "Pipeline([('scale', StandardScaler()), ('lr', LinearRegression())])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use the `make_pipeline()` function to make the pipeline without naming the steps ourselves:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
       "         normalize=False))])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "make_pipeline(StandardScaler(), LinearRegression())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
